Causal Discovery and Optimal Experimental Design for Genome-Scale Biological Network Recovery

Causal discovery of genome-scale networks is important for identifying pathways from genes to observable traits –e.g. differences in cell function, disease, drug resistance and others. Causal learners based on graphical models rely on interventional samples to orient edges in the network. However, these models have not been shown to scale up the size of the genome, which are on the order of 103-104 genes. We introduce a new learner, SP-GIES, that jointly learns from interventional and observational datasets and achieves almost 4x speedup against an existing learner for 1,000 node networks. SP-GIES achieves an AUC-PR score of 0.91 on 1,000 node networks, and scales up to 2,000 node networks – this is 4x larger than existing works. We also show how SP-GIES improves downstream optimal experimental design strategies for selecting interventional experiments to perform on the system. This is an important step forward in realizing causal discovery at scale via autonomous experimental design.


INTRODUCTION
A biological network describes causal interactions between variables in a biological system. These variables can be genes, transcription factors, proteins and metabolites. Interactions between these variables, along with external environmental factors, maps the genome of a species (or genotype) to an observable trait (or phenotype) (See Figure 1). Predicting phenotypes from genotypes is one of the core challenges in systems biology ( [22], [25], [16]). Genotype-phenotype models are useful for predicting cell function, disease, drug resistance, and many other problems related to biology and public health. In this paper we focus on biological network recovery of the first layer in the interaction hierarchy (the gene regulatory network) -which we name a "genome-scale network" since the number of variables corresponds to the number of genes in a genome. Recovery of genome-scale networks is important for understanding drivers for phenotypic changes and for identifying new drug targets. Moreover, a better understanding of genome-scale networks enables more accurate downstream phenotype prediction models including those that use machine learning (See latent topics in Figure 1).
Biological networks are inferred from experimental data. Recovery of biological networks is a reverse engineering problem: given experimental data, what is the network that gives rise to the data? Existing methods for network recovery fall into two primary categories: pairwise information theoretic methods, and graphical model methods. The advantages of the former are that they are embarrassingly parallel and have been shown to be effective on large-scale biological datasets ( [7], [14], [18]). The main disadvantage of these methods is that they can only learn from one data distribution -e.g. they cannot incorporate new experimental data after an intervention has been applied to a system. The second category of network recovery methods are graphical models, often called structure learners. The advantages of these are that they have an immediate causal interpretation (the direction of an edge implies cause and effect) and as a result a suite of methods have been built to jointly learn from observational and interventional data distributions ( [10], [31], [29]) -we call these "joint learners". This ability to jointly learn from different distributions takes a statistical model (e.g a Bayesian Network) and converts it into a causal model (e.g a Causal Bayesian Network). The disadvantages of graphical models are that they do not scale well with the number of nodes in the graph (the graph search space is combinatorial), and joint learners do not have parallel implementations. This means that joint structure learners have not been evaluated on datasets at scale -at most we observed evaluations on 500 node networks. Given that the genomes of most species are on the order of 10 3 -10 4 genes, there is a need to scale up these joint structure learners to larger networks. In this paper we introduce a new hybrid structure learner, named SP-GIES 1 , that leverages the advantages of both categories of methods.
A causal model is preferable to a statistical model because directed causal edges in a network allow us to identify pathways, or mechanisms of action, from genotype to phenotype. In applications of biology and medicine, researchers seek to model both functional outcomes and the mechanisms leading to outcomes so that these can be understood and manipulated through therapeutics. Given this priority, there is a demand for causal models in this application space. This motivates the need to incorporate data from interventions on the system. We estimate the space of possible interventions Figure 1: The hierarchy of biological networks which is partially known. Levels in this hierarchy can be represented as causal networks. Interventions activate mechanisms of action and reveal causal relationships within a mechanism. The space of possible interventions is large, motivating the need for autonomous design loops like AI driven experimental design and robotic laboratories. In theory, recovery of all networks allows for full genotype to phenotype mapping. However, in practice it is impossible to fully capture the data distribution and control for confounding variables. This motivates representing mechanisms of action as topic latent variables that can be learned via topic modeling.
in the environment and gene spaces to be 10 4 -10 5 depending on the species under investigation. A brute force sampling of this space of interventions is experimentally expensive and wasteful. Instead, there is a need to design feedback loops that choose advantageous interventions based on domain knowledge and learned knowledge. Work in optimal experimental design (OED) uses expected gain to choose interventions on variables in Bayesian Networks to drive causal discovery ( [29], [11], [1], [8]). This paradigm allows us to reason about how new inferred knowledge can connect to our existing domain knowledge, and provide a path forward for integrating AI into science domains.
The contributions of this work are as follows: (1) The implementation of a new hybrid structure learning method named Skeleton-Primed Greedy Interventional Equivalence Search (SP-GIES), based on the existing method GIES, that jointly learns from observational and interventional data. SP-GIES achieves better network recovery accuracy and a faster time to solution on large-scale networks compared to existing methods. (2) The application of SP-GIES to an optimal experimental design feedback loop that chooses optimal interventions on genes. (3) An analysis of the complexity of these methods and discussion of future directions, including unified libraries of causal algorithms, in order to achieve genome-scale network recovery for genotype to phenotype mapping with interventions.

PRELIMINARIES 2.1 Causal Bayesian Networks
Let = ( , ) be an acyclic graph defined by a set of vertices and directed edges . The vertices of the graph represent random variables 1 ... . Under the Markov Assumption for Bayesian Networks, each variable is conditionally independent of its nondescendants given its parents. The joint distribution of a Bayesian Figure 2: How to convert from purely statistical model (such as a Bayesian Network) to causal model (such as a Causal Bayesian Network) with interventional data. By modeling interventional distributions, the interventional essential graph ( ) is closer to the true underlying causal graph compared to the essential graph ( ). This figure is adapted from Schölkopf et al. [27] network factorizes as (X) = =1 ( | ( )) [28]. The following definitions describe important properties of Bayesian Networks that are relevant for structure learning.  ). An essential graph Ess(G) is a partially oriented graph that uniquely represents the MEC of a DAG. Directed arrows only exist on edges consistent across the equivalence class and all other edges are left undirected [2].
A Causal Bayesian Network is a Bayesian Network where edges represent causal and effect relationships [20]. For example if a directed edge exists from to , this means that no unobserved confounding variables are responsible for the their correlation and is a direct cause of .

Structure Learning with Observational Data
Constructing the graph structure from a set of instances (sampled values for each node ) is called structure learning. The following assumptions are made for learning a graphˆfrom data.
(1) Causal sufficiency -All random variables are observed, i.e. there are no hidden variables (2) Causal Markov Assumption -The data is generated from an underlying Bayesian Network ( * , * ) over a set of random variables (3) Faithfulness Assumption -The distribution * over induced by ( * , * ) satisfies no independencies beyond those implied by the structure of * We are given a data set = { 1 ... } of samples from * -this data is assumed to be independent and identically distributed (iid). The task is to learn a modelˆ= (ˆ,ˆ) that defines a distribution that best fits true distribution * [13].
All graphs in the same MEC give rise to the same observational distribution. Therefore the best we can do with only observational data is recover the true graph's MEC. Several graphical model based methods learn the structure from observational data only. These are split into constraint-based and score-based methods. There are also pairwise information theoretic methods that learn from only observational data. We discuss these in Section 3.

Structure Learning with Interventional Data
In addition to structure learning on observational data, we want to incorporate interventional datasets into our learning procedures.
Intervening on a set of random variables ⊆ removes incoming edges to the random variables , and sets the joint distribution to a new interventional distribution ( = ) . Here, we are setting node to a value . In this paper, we assume a "hard" interventions, where a node is set to a constant value, because this mimics the types of interventions we are able to perform in biology applications -e.g. gene knockout experiments. This is contrast to a "soft" intervention which is samples = from a distribution.

Definition 2.3 (Interventional Distribution).
The joint distribution after a hard intervention is: Jointly learning on observational and interventional data allows us to correctly isolate causal relationships and orient edges in the graph. An estimated ( ) from observational data can be further refined into an I-essential graph ( ( )). This is a partially oriented DAG that represents an interventional Markov Equivalence class (I-MEC) (See Figure 2). There are several existing methods that jointly estimate structure given observational and interventional data. See Vowels et al. [30] for a full list of these structure learners. Table 1: Properties of all the learners covered in this paper. Methods are split row-wise by the nature of the learner. Data type refers to observational and/or interventional data capability of the learners. Evaluation dataset lists the benchmarks used for evaluation for each paper. Finally, the table also lists the maximum number of nodes the algorithm was evaluated on and the worst case complexity of the algorithm as reported in the corresponding papers. is the number of nodes, is the number of samples. is the maximal degree of any node in the graph. Note that for joint learners the average case complexity is not exponential, however exact calculations depend on clique size and network topology.

Data Type
Evaluation dataset Scaling Paper Observ. Interv. Random DREAM Large-Scale Max # of nodes Worst case runtime CLR [7] ✓ ✓ 4,345 [15], [17] ✓ Table 2: Properties of the OED methods covered in the paper. OED criteria refers to the utility function used for selection of next experiments. Note that the the complexity listed here corresponds only to choosing the next intervention or sets of interventions. Each algorithm here also has the cost of sampling from the posterior distribution which is equivalent to the complexity of the learners in Table 1 OED criteria Evaluation dataset Scaling Paper Info. theory Edge orient. Random DREAM Large-Scale Max # of nodes Worst case runtime Ness et al. [19] ✓ ✓ 17 (| | * !) Hauser and Bühlmann [11] ✓ 40 polynomial in Tong and Koller [29] ✓ 12

Optimal Experimental Design
We have access to new interventional data by sampling the system, but we wish prioritize the interventional experiments that will recover the true causal graph the fastest. This is done via maximization of an expected utility function, , over the set of potential interventions:ˆ= arg max E | [ ] where is a utility function like mutual information, number of oriented edges, etc. Our goal is to recover the underlying Causal Bayesian Network given a mix of interventional and observational samples, and some prior information about the causal edges in the graph. We cast this as a Bayesian Inference problem over the space of DAGs. We have a prior ( ) which encodes any prior structural knowledge about the underlying DAG. Applying Bayes Theorem gives us the posterior distribution ( | ) ∝ ( | ) ( ). The likelihood is where we have marginalized out the parameters of the graph [29]. is a mix of observational and interventional data. The posterior distribution is sampled via the joint structure learners described in the previous section. After sampling new data from the chosen intervention using OED, the new data is concatenated to the existing dataset and the posterior is re-sampled using the structure learners. Table 1 and Table 2 provide summary of the related work in biological network recovery and optimal experimental design respectively.

Structure Learning With Interventional Data for Biology Network Recovery
Pairwise information theoretic learners use calculated metrics to estimate the dependence of two variables. A threshold is applied to the metric to obtain a network representation of the system. Pinna et al. [23] use a deviation matrix (difference between intervened samples and steady state samples) to estimate an initial network. This algorithm won the DREAM4 network recovery challenge and was the state-of-the-art for gene regulatory network reconstruction at the time. Faith et al. [7] introduce CLR (context likelihood of relatedness) which calculates the pairwise B-spline mutual information between genes in the E. coli K-12 gene regulatory network. CLR is implemented in MATLAB, with a fast parallel kernel for calculating the pairwise mutual information. Margolin et al. [18] developed ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) which is a similar pairwise mutual information based network recovery algorithm. ARACNE additionally employs a DPI (data processing inequality) algorithm that removes indirect interactions. ARACNE was more recently upgraded to ARACNE-AP [14], which is implemented in Java, to handle adaptive partitioning to calculate the mutual information -this achieved a 200x computational performance improvement. Graphical model based learners reveal a much finer structure than pairwise methods because they are based on Bayesian Networks. The PC algorithm is a constraint-based structure learner that only handles observational data. PC learns a graph by performing conditional independence testing on pairs of nodes conditioned on other nodes. Since its introduction, many parallel implementations and optimizations of the PC algorithm have been added ( [15], [17]) including one that uses GPUs ( [34]). Existing structure learners that jointly learn from observational and interventional data (joint learners) are either score-based or hybrid score/constraint-based methods. Rau et al. [24] propose an algorithm called MCMC Mallows based on Markov Chain Monte Carlo sampling over the space of potential Gaussian DAGs and optimizes the best scoring DAG. MCMC Mallows was evaluated on the DREAM4 networks, and scored better than Pinna et al. [23] on two of the networks. Hauser and Bühlmann [10] proposed Greedy Interventional Equivalence Search (GIES) which is a greedy score-based approach that is an extension of the original GES learner. The worst case complexity of GIES is exponential in (the number of nodes in the graph), however the authors note and show empirically that GIES is more efficient than this worst case. GIES was evaluated on DREAM4 and achieved scores in the top third of all participants. Intervention Greedy Sparsest Permutation (IGSP, Wang et al. [31]) is a hybrid score/constraint, and non-parametric extension of the GSP learner. IGSP associates a DAG to every permutation of random variables and greedily updates the DAG by transposing elements of the permutation. IGSP performed comparably to GIES on protein signaling data [26] and better than GIES on a real gene expression dataset with 24 genes. The worst case complexity of IGSP is also exponential in . IGSP and GIES can be implemented with polynomial complexity by restricting the maximum degree of each node in the graph, however, for large scale networks with hubs (nodes with high connectivity) the maximum degree may be hard to determine. We show the empirical scaling of these joint learners against our proposed learner, SP-GIES, in Section 4.

Optimal Experimental Design for Biology Network Recovery
One way to choose the optimal intervention is to choose the set of interventions that maximizes the mutual information or leads to the greatest decrease in entropy. Tong and Koller [29] sample a set of orderings from the distribution over graphs and parameters. They then use this set of orderings to compute entropy terms and select the intervention with the lowest expected posterior loss. Agrawal et al. [1] extend the work of Tong and Koller [29] with the ABCD strategy -a greedy implementation of batched experimental design. They use a utility function based on the expected entropy decrease of an intervention. This requires calculating expectations over the graph and parameter spaces, however, they are able to make a tractable algorithm by using bootstrapping and weighted importance sampling. Another way to choose the optimal intervention is to choose the intervention that leads to the maximal number of oriented edges. Ness et al. [19] use optimal experimental design to recover protein signaling networks [26]. They use a utility function based on the expected information gain of an intervention given the observational MEC and other interventions in the batch. This algorithm, however, has factorial dependence on batch size. Ghassami et al. [8] use the expected number of oriented edges of an essential graph as the utility function. The essential graph of the ground truth network is first estimated using a constraint based method like the PC algorithm. Hauser and Bühlmann [11] similarly propose a utility function based on the number of oriented edges of a skeleton graph.

METHOD
Here we describe our hybrid algorithm Skeleton-Primed Greedy Interventional Equivalence Search (SP-GIES). A skeleton is an undirected graph with edges that correspond to edges in a DAG. The algorithm is a simple sequential use of an observational learner to estimate a skeleton, and the joint graphical model structure learner GIES to orient edges. The two step algorithm is as follows: (1) Use ARACNE, CLR or PC to generate a skeleton with only observational samples (2) Use the output of (1) to restrict the possible edge set for the GIES learner. Jointly learn from observational and interventional data using GIES.
If the PC algorithm is used then the input to (2) is an Ess(G) which is represented by a partially oriented DAG or PDAG. We chose GIES for (2) since it has an open source implementation that is part of the widely used pcalg library in R. For scaling studies, we chose to use the R implementation of PC algorithm. This is because CLR and ARACNE are implemented in MATLAB and Java respectively. Since GIES only has an R implementation, we chose to use the R implementation of PC for our scaling studies. Figure 3: A weak scaling study comparing the SP-GIES, GIES and IGSP algorithms. The number of samples is fixed to =1,000. Data points are averaged over 3 runs. The study was performed on a 2.8 Ghz AMD EPYC Milan 7543P 32 core CPU and a NVIDIA A100 GPU. The IGSP 1,000 and 2,000 node runs exceeded our quota and is not plotted here.
While (1) is easily parallelizable on multiple processors or deployable to a GPU, the bottleneck of the computation is in the GIES step (see Figure 4). Still in Figure 3, we see that SP-GIES is able to reach a faster time to solution than GIES. To understand why this is, we investigated the GIES algorithm. GIES is a greedy score-based structure learner. The score function is the Bayesian Information Table 3: Evaluation of learners on random networks of size 10. Three different types of random networks were generated: Erdös Renyi [6], Scale free [4], and Small world [32]. and refer to parameters used to generate the graphs, not the number of nodes and degree as used previously. Results are averaged over 30 random graphs, each learned with 100 data samples. The superscript on the algorithms refers to the data type used (O for observational, OI for mixed observational and interventional). For SP-GIES, ARACNE, CLR, and then PC were used for skeleton estimation for Erdös Renyi, Scale Free and Small World respectively since these were the best performers.  Step (1) is cuPC and

Random
Step (2) is GIES. We see that as the graph size increases, the bottleneck is the GIES step.
Criteria which is based on the maximum likelihood estimate of the graph given the current data. Starting from an empty essential graph, in the forward phase of the algorithm an edge is added to ( ) such that the new essential graph has the maximal score. The algorithm iterates through all possible edges until it finds the corresponding essential graph with the highest score. The backward phase is similar, except an edge is removed instead of added. By restricting the set of possible edges to the GIES algorithm by first estimating the skeleton, we narrow the size of the search space for both phases -this is what results in the speedup. Moreover, since we use a fast GPU implementation of the PC algorithm (Zarebavani et al. [34]), the overhead of estimating the skeleton first is minimal. Table 1 shows that all joint learners have a worst case exponential complexity -this is attributed to the combinatorial search space over graphs. However, the average empirical complexity is related to the topology of the network i.e, the size of the largest clique. Since the average complexity is difficult to calculate, we show in Figure 3 that SP-GIES reaches a faster time to solution as the size of the network increases. We ran a scaling study for SP-GIES using a similar study as Hauser and Bühlmann [10] except we were able to scale up to =2,000 nodes versus the =500 nodes in the GIES paper. IGSP was unable to complete for the 1,000 and 2,000 node nodes within our allotted quota of 72 hours. We see that SP-GIES is able to reach a faster time to solution than both IGSP and GIES because of the restriction of the edge set discussed previously.

DATASETS
We tested SP-GIES against existing learners on datasets at various scales. Random networks were generated using the networkx Python library. Observational data was generated assuming a linear Gaussian model. We generated 100 observational samples and one interventional sample per node. The DREAM4 insilico challenge provides observational and interventional gene knockout data of gene regulatory networks synthetically generated from stochastic ODEs. The dataset contains 10 observational samples and one interventional sample per node. The ground truth networks are subnetworks of the larger E.coli gene regulatory network. The Reg-ulonDB dataset was provided by Faith et al. [7]. The network is the current estimated E.coli K12 gene regulatory network and contains 1146 nodes, 3179 edges and 524 real experimental samples. Of the 524 experimental samples, 35 are samples are taken from gene knockout experiments. The RegulonDB dataset also contains environmental interventions. Existing learners currently only model gene variables, therefore any environment interventional samples were treated as observational data. Future work includes explicitly modeling environmental condition variables.
Each learner was evaluated on three metrics: Structural Hamming Distance (SHD), Structural Interventional Distance (SID) and the AUC-PR. The SHD is the L1 error between the generated DAG Table 4: Evaluation of learners on the DREAM4 networks of size 10. The dataset contains 11 observational samples and 10 interventional samples. Interventional samples are gene-knockout experiments. For SP-GIES, ARACNE was used for skeleton estimation, except for Network 4 which used CLR. Note that for Network 3, we used adaptive learning in the GIES subroutine (adaptive="triples") to achieve the best performance for the SP-GIES learner. We did not see significant improvement with adaptive learning for the other networks.  The AUC-PR is the area under the precision-recall curve. All three of these metrics are common evaluations for causal discovery. However, we note that SHD and AUC-PR are biased towards empty graphs. Suppose a structure learner learns a graph with no edges and with = | |. Another learner learns a graph with > | |, but correctly learns some of the causal edges in the true graph. According to the SHD, is better and we may posit that is the better learner for the data. However, an argument can be made for because it captures more causal relationships than . This means that a lower SHD may not always indicate a model that best fits the data distribution. A similar argument can be made for the AUC-PR. The drawback of the SID is the computational complexity is quadratic in the number of nodes for sparse networks, and cubic in the number of nodes for dense networks. This is shown empirically up to 50 nodes by Peters and Bühlmann [21].

RESULTS
Results of our evaluation on random networks, DREAM4 networks, and genome-scale networks are shown in Table 3, Table 4, and Table 5 respectively. The observational learners we evaluated are PC, ARACNE, and CLR. The interventional learners we evaluated are GIES, IGSP, Pinna and SP-GIES. We did not include MCMC Mallows in our evaluation because of the scalability challenges associated with Monte Carlo sampling. See Rau et al. [24] for an evaluation of MCMC Mallows on small DREAM4 and random datasets. As a reference, we also include the scores for a network without any dependencies (no edges), named NULL. For CLR and ARACNE, a threshold value must be chosen to filter edges. For CLR with the random and DREAM4 networks, we chose values so that approximately the top 10% of edges were retained. For CLR with the RegulonDB dataset, we followed the work of Faith et al. [7] and chose the threshold that resulted in 60% precision. For ARACNE, we used the threshold associated with a p-value of 10 −8 -this is the default used by Lachmann et al. [14].
For random networks of size 10, the best performing learner depends on network type. GES with no interventional data performs the best on scale free networks. SP-GIES performs best on small world networks. Both GIES and SP-GIES perform comparably for Erdös Renyi networks. For DREAM4, SP-GIES most frequently gets the best score across all metric types over all five networks. We see that the addition of interventional data improves the performance of algorithms that can handle both data types (namely GIES, and SP-GIES).
For the RegulonDB network, the best scoring method is CLR. This is because the CLR method calculates a threshold value to prune edges so that the learner achieves 60% precision compared to the ground truth network -this makes it an unfair comparison since the ground truth network must be known a priori. We used CLR for the skeleton estimation of SP-GIES. An unexpected result is that SP-GIES appears to worsen the initial estimate of the CLR skeleton. To understand this, we zoomed in on a subnetwork of RegulonDB that contains three hubs (highly connected nodes). Figure 5 shows that CLR correctly identifies 3 hubs in the subnetwork. However, SP-GIES removes many edges from this initial estimate and adds new edges elsewhere. As a result, no hubs are detected using SP-GIES. One possible reason is that the GIES learner uses the maximum likelihood estimate to score each candidate graph. This calculation assumes the data is sampled from a linear Gaussian distribution -which is not the case for real datasets like RegulonDB where nonlinear effects may be present. Since mutual information can capture nonlinearity, we do not see this issue reflected in CLR. To test this theory, we generated synthetic linear Gaussian data from the RegulonDB subnetwork topology. Figure 5 shows that with synthetic data the SP-GIES can detect 3 hubs in the subnetwork. The nonlinearity of the dataset is only a partial explanation because the SP-GIES estimate with synthetic data is still further from the ground truth compared to the CLR estimate. Further analysis is needed for understanding the data regimes and network topologies within the scope of joint learners like SP-GIES.
For a few of the networks we evaluated here, the NULL graph achieves the best score. Many real world networks, including gene regulatory networks, are sparse. As a result, the NULL learner often appears to perform very well since the true networks lie close to an empty network in combinatorial space. Interestingly, the NULL graph never scores best for the SID metric. This could indicate that causal network recovery methods should be evaluated with the SID rather than other metrics. However, more analysis is needed to understand the validity of this and is outside the scope of this paper.
To understand the performance gains of SP-GIES over GIES on optimal experimental design, we evaluated both algorithms with two different OED criteria on the DREAM4 insilico 10 node dataset. We limited this evaluation to small networks since the time complexity of OED includes posterior sampling and optimal gene selection for each round of intervention. We follow the framework of [1], which includes bootstrap sampling of the posterior distribution. An example result for a DREAM4 insilico network is shown in Figure 6. We observe three trends here. First, the edge orientation strategy achieves better performance than the information gain strategy. Second, the SP-GIES learner boosts the initial performance of the learner, however, it easily stagnates and additional interventional samples do not continue to improve the algorithm at the same rate as the GIES based strategies. Still the SP-GIES x OED runs achieve better performance than GIES x OED, and there is some improvement in accuracy with the addition of interventional data. We speculate that stagnation may happen with the SP-GIES joint learner because by fixing the skeleton, we also restrict the search space of the learner. This may lead to the GIES step of the algorithm The estimated undirected subnetwork given by CLR, which correctly identifies the 3 hubs. The threshold was chosen so that 60% of the ground truth network was recovered. (C) The subnetwork estimated by SP-GIES using the CLR skeleton. With real data, no hubs are detected. (D) The subnetwork estimated by SP-GIES using the CLR skeleton with synthetic data. The 3 hubs are detected, although the graph is much more sparse than the CLR estimate and ground truth network.
getting stuck in a local minima. This motivates incorporating some level of uncertainty into the skeleton, or randomly including sets of edges outside the skeleton into the search space -we leave this to future work. Third, we see that the random strategy is comparable to the OED strategies. In other words, prioritizing certain experiments over others does not necessarily result a better estimate of the causal network. One possible explanation is that the network topology may be such that intervening on certain nodes is not advantageous compared to others. However, another reason is that the OED strategy relies on a good model (causal graph) to estimate the posterior distribution. Without a good model, the calculation of the utility function will be inaccurate and the wrong experiments may be prioritized. Although SP-GIES provides us with a better estimate of the causal graph compared to GIES, it may not be sufficient for OED. This is analogous to what we see in active learning; when the model performance is poor, a random labeling of new samples is effective for model improvement.

DISCUSSION
Understanding genome-scale networks is important for building a causal mapping from genotype to phenotype. However, recovery of genome-scale networks from interventional datasets has not yet been realized because of the computational complexity of structure learners that jointly learn from mixed data. Existing methods like GIES, IGSP, MCMC Mallows and others are only evaluated on small networks up to 500 nodes and exhibit worst case exponential complexity. This makes subsequent optimal experimental design techniques computationally intractable since the complexity is then multiplied by the computation of choosing the optimal intervention. This is a huge issue in realizing causal discovery at scale via autonomous design loops.
In this paper, we note that structure learners built for observational datasets have parallel implementations on multiple processors and on GPUs. Our SP-GIES learner first estimates a skeleton using these parallel implementations. This restricts the edge set for the subsequent joint structure learner and improves the time to solution. However, SP-GIES is still bottlenecked by the scaling of the joint learner GIES -for example running network sizes of 10 4 nodes did not complete in 24 hours. To realize recovery of genome-scale networks on the order of 10 3 -10 4 , and to sample an interventional space on the order 10 4 -10 5 , we must have breakthroughs in distributed parallel implementation of joint learners. Joint structure learners currently do not have parallel implementations because they are score-based learners (or hybrid constraint/score-based learners). At each step of the algorithm the candidate graph is scored, typically using a function of the likelihood ( | ) over the entire graph. As we saw in Section 2, the joint distribution of a DAG is a product of conditional probabilities which, for each node, depends on the parent nodes. As a result, the calculation of the likelihood is sequential and typically optimization over this space is done greedily. It is not intuitive how the graph may be partitioned into subgraphs for individual compute kernels. Constraint based methods, on the other hand, use pairwise conditional independence testing to resolve edges. A naive approach to scaling is to generate a separate compute thread per calculation, however, one can group variables into blocks to further minimize computations and boost performance. This type of performance enhancement has not yet been realized by score-based learners, and this presents a roadblock for existing joint learners.
The performance of SP-GIES on the RegulonDB dataset -which is the most biologically relevant dataset -suggests that future work should also incorporate nonlinearity into joint learners. One approach is to modify the GIES score function. Since closed form solutions for the maximum likelihood estimate only exist for a small set of functions, an alternative is to use a nonparametric learner like IGSP. However, IGSP does scale to the size of the RegulonDB dataset. Moreover, the IGSP learner in general appears to perform poorly on random and DREAM4 networks. We note that the use case for this algorithm is considerably different than the datasets used here. IGSP was designed for cases where there are sufficient interventional data samples present in order for hypothesis testing to be successful. For the datasets used here, only a handful of interventional samples are available for each node. Structure learning for nonlinear datasets has been studied in works such as Gretton et al. [9], Yu et al. [33], Zheng et al. [35]. Recent works use neural networks by recasting the combinatorial optimization into a continuous optimization. However, these methods suffer in low data regimes. In order to realize network recovery at scale for real biological datasets, there is a need to incorporate strategies from nonlinear structure learners into joint learners.
In evaluating existing methods and designing the SP-GIES method, we found that learners and OED strategies are implemented in a varied set of languages including Java, MATLAB, R, Python and C. This is due to the diverse nature of backgrounds interested in causal discovery of biological networks including from biology, statistics, computer science, machine learning, and representation learning. There has been a recent interest in unifying tools for causal discovery -for example CausalDiscoveryToolbox [12] , Pgmpy [3] and CausalDAG [5] are Python libraries that provide varying support for graphical and/or causal modeling. However, in the case of CausalDiscoveryToolbox, the library is a Python wrapper for R code, which in turn is a wrapper for C code. For Pgmpy, no joint interventional learners are implemented. CausalDAG provides the best overall framework for causal discovery with interventions, although many of the implementations are still incomplete. To realize better parallel implementations of causal algorithms and OED strategies, we should encourage collaborations with the supercomputing community. Therefore, there is a need for exclusive Python or C implementations of these models and algorithms since these languages are popular in the supercomputing community.

CONCLUSION
We present SP-GIES, a joint structure learner that leverages parallel observational learners to estimate a skeleton and initialize the GIES joint learner. We provide a systematic evaluation of SP-GIES against existing methods for datasets at various scales and with various metrics. We see that SP-GIES is able to provide better network recovery accuracy for larger scale networks up to 2,000 nodes and on biological networks up to 1,146 nodes. This scale of network recovery for joint learners has not been achieved to our knowledge. SP-GIES reaches a faster time to solution than existing method GIES and provides up to 3.87x speedup. We also show that SP-GIES improves subsequent OED strategies. Future work includes a distributed parallel implementation of SP-GIES, and incorporation of SP-GIES into an autonomous experimentation design loop.